DSCI 100 Group 17 Report: Classifying Celestial Bodies from Spectral Characteristics¶
Group members:
- Aidan Wong
- Ben Tyler
- Tyson Quan
Introduction¶
Stars are large spheres of hot gas that emit heat and light into space. They are composed of mostly hydrogen, with some helium and other elements. The sun is an example of a star and is the closest star to Earth (NASA, n.d.b).
Galaxies are clusters of planets, stars, gasses, and dust that are all held together by gravity. Galaxies are very large and emit light from the stars and other things that it contains. The Milky Way Galaxy, where the Earth is located, is an example of a galaxy (NASA, n.d.a).
Quasars are the core of active galaxies and they are powered by supermassive black holes. They emit immense amounts of heat and light due to the friction of material being drawn in. The closest quasar to Earth is called 3C 273 and can be seen with an 8-inch telescope (Cooper, 2018).
The classification of celestial objects into stars, galaxies, and quasars has been pivotal for the understanding of planet Earth's positioning within space. It has led to key insights such as the discovery that the Andromeda galaxy is separate from our own, and this classification continues to be essential for astrological research (Clarke, 2020).
In this report, we will use data on celestial objects to answer the following question: "Based on its redshift and brightness in different wavelengths of light, what type of celestial object is this?"
Our data set is from Sloan Digital Sky Survey Data Release 16. It was collected by the Sloan Digital Sky Survey Telescope, which uses powerful telescopes measuring spectral characteristics of light (Fukugita et al., 1996). It contains data on 100,000 astronomical objects, divided into three classes: galaxies, stars, and quasars. The data it contains includes redshift, which reflects how quickly an object moves (Fedesoriano, 2022), and brightness in five wavelengths of light: ultraviolet, green, red, near infrared, and infrared light. We will focus on these six variables to help predict the class of astronomical objects.
Import Libraries¶
import pandas as pd
import altair as alt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.compose import make_column_selector
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
# Settings on Juypter Notebook for printing and plotting graphs
set_config(transform_output="pandas")
alt.data_transformers.disable_max_rows()
# Seed to ensure reproducible report
np.random.seed(1234)
Methods and Results¶
This section consist of 3 main parts:
- Loading and Cleaning Data
- Exploratory Data Analysis
- Classifcation Analysis
Methods¶
In this section, we will explain the method we used to illustrate our findings.
Firstly, we would use six variables represented as columns in the data set: u, g, r, i, z and redshift. The first five variables are brightness values in different bands of light: ultraviolet, green, red, near-infrared, and infrared (Fukugita et al., 1996). They are measured in magnitude, which is unitless and reflects photon abundance (SDSS Voyages, 2024a). These magnitudes could help determine object class because quasars, galaxies, and stars can have unique colours (SDSS, n.d.a). We would also include redshift, which indicates the lengthening of an object's light wavelengths due to the expansion of the universe (SDSS, 2024b). Galaxies and quasars often have higher redshift values than stars, so higher redshift could indicate them (Crockett, 2021).
After gathering the necessary data, we proceeded with data preprocessing, which involved filtering and renaming the columns to ensure comprehensibility and ease of use. Once the dataset was cleaned and prepared, we conducted an exploratory analysis to gain a thorough understanding of the data. Initially, we examined the data types of each column and assessed the distribution of classes. Since we planned to perform K-Nearest Neighbor (KNN) classification, achieving a balanced distribution of classes was crucial for accurate results. To identify suitable variables for classification, we initially visualized the data using density plots, which helped us analyze the distinct characteristics exhibited by each variable.
After completing the exploratory analysis, we proceeded with the classification analysis using the six selected variables. As the class distribution in the original dataset was unbalanced, we performed upsampling to create a balanced dataset. Subsequently, we followed the standard procedure for KNN classification. This involved splitting the dataset into training and testing sets. We also created a pipeline containing a KNN model object and a preprocessor to standardize the numerical variables. To determine the optimal parameter k, we conducted 5-fold cross-validation on the training data set, which used the pipeline and tested k values from 2 to 14.
To visualize the results of the cross-validation, we created a plot of k values against estimated accuracy, which aided in selecting the appropriate k value. Next, we evaluated the performance of our classification model using scoring functions and cross-tabulation analysis to gain a comprehensive understanding of the model's results. Additionally, we created a pairplot to explore the relationships between each parameter used in the classification model. Based on these findings, we repeated the same procedure for a new set of chosen variables.
Upon completing both models, we reached conclusions based on our findings, which are presented in the following section.
1. Loading and Cleaning Data¶
# Load in the data file from the web (Pandas, 2019).
url="https://drive.google.com/file/d/1LM-kB1xP90O9RBY5yjRP1mET_BKOOhxC/view?usp=sharing"
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
star_data = pd.read_csv(url)
star_data.head()
| objid | ra | dec | u | g | r | i | z | run | rerun | camcol | field | specobjid | class | redshift | plate | mjd | fiberid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1237666301628060000 | 47.372545 | 0.820621 | 18.69254 | 17.13867 | 16.55555 | 16.34662 | 16.17639 | 4849 | 301 | 5 | 771 | 8168632633242440000 | STAR | 0.000115 | 7255 | 56597 | 832 |
| 1 | 1237673706652430000 | 116.303083 | 42.455980 | 18.47633 | 17.30546 | 17.24116 | 17.32780 | 17.37114 | 6573 | 301 | 6 | 220 | 9333948945297330000 | STAR | -0.000093 | 8290 | 57364 | 868 |
| 2 | 1237671126974140000 | 172.756623 | -8.785698 | 16.47714 | 15.31072 | 15.55971 | 15.72207 | 15.82471 | 5973 | 301 | 1 | 13 | 3221211255238850000 | STAR | 0.000165 | 2861 | 54583 | 42 |
| 3 | 1237665441518260000 | 201.224207 | 28.771290 | 18.63561 | 16.88346 | 16.09825 | 15.70987 | 15.43491 | 4649 | 301 | 3 | 121 | 2254061292459420000 | GALAXY | 0.058155 | 2002 | 53471 | 35 |
| 4 | 1237665441522840000 | 212.817222 | 26.625225 | 18.88325 | 17.87948 | 17.47037 | 17.17441 | 17.05235 | 4649 | 301 | 3 | 191 | 2390305906828010000 | GALAXY | 0.072210 | 2123 | 53793 | 74 |
# Cleaning data
# Filter relevant columns and rename columns for a more comprehensible understanding
star_filtered = (
star_data.loc[:, ["u", "g", "r", "i", "z", "redshift", "class"]]
.rename(columns={
"u":"Ultraviolet",
"g":"Green",
"r":"Red",
"i":"Near Infrared",
"z":"Infrared",
"redshift":"Redshift",
"class":"Class"
})
)
star_filtered.head()
| Ultraviolet | Green | Red | Near Infrared | Infrared | Redshift | Class | |
|---|---|---|---|---|---|---|---|
| 0 | 18.69254 | 17.13867 | 16.55555 | 16.34662 | 16.17639 | 0.000115 | STAR |
| 1 | 18.47633 | 17.30546 | 17.24116 | 17.32780 | 17.37114 | -0.000093 | STAR |
| 2 | 16.47714 | 15.31072 | 15.55971 | 15.72207 | 15.82471 | 0.000165 | STAR |
| 3 | 18.63561 | 16.88346 | 16.09825 | 15.70987 | 15.43491 | 0.058155 | GALAXY |
| 4 | 18.88325 | 17.87948 | 17.47037 | 17.17441 | 17.05235 | 0.072210 | GALAXY |
2. Exploratory Data Analysis¶
This section performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis.
# General understanding of the dataset
star_filtered.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ultraviolet 100000 non-null float64 1 Green 100000 non-null float64 2 Red 100000 non-null float64 3 Near Infrared 100000 non-null float64 4 Infrared 100000 non-null float64 5 Redshift 100000 non-null float64 6 Class 100000 non-null object dtypes: float64(6), object(1) memory usage: 5.3+ MB
# Understand the proportion of classes in the dataset to determine whether we have to upsample the data or not
star_filtered["Class"].value_counts(normalize=True)
Class GALAXY 0.51323 STAR 0.38096 QSO 0.10581 Name: proportion, dtype: float64
From the above information, we are able to understand the data types of our dataset and the proportion of the classes.
With this information, we can conclude that we should upsample the data set to have a fair classification of celestial bodies.
This section creates a visualization of the data set that is relevant for exploratory data analysis related to the planned analysis.
# Standardizing the data for plotting in the below sections
preprocessor_keep_all = make_column_transformer(
(StandardScaler(), ['Ultraviolet', 'Green', 'Red', 'Near Infrared', 'Infrared', "Redshift"]),
remainder="passthrough",
verbose_feature_names_out=False
)
# Use Fit to compute all the neccessary values to scale the data
preprocessor_keep_all.fit(star_filtered)
# Transform function to apply the standardization
star_scaled = preprocessor_keep_all.transform(star_filtered)
star_scaled.head()
#star_scaled.nlargest(5, "Redshift")
| Ultraviolet | Green | Red | Near Infrared | Infrared | Redshift | Class | |
|---|---|---|---|---|---|---|---|
| 0 | 0.065633 | -0.272293 | -0.287759 | -0.230598 | -0.226791 | -0.389669 | STAR |
| 1 | -0.194147 | -0.103121 | 0.317192 | 0.580613 | 0.705310 | -0.390143 | STAR |
| 2 | -2.596213 | -2.126356 | -1.166443 | -0.746957 | -0.501159 | -0.389555 | STAR |
| 3 | -0.002769 | -0.531149 | -0.691260 | -0.757044 | -0.805267 | -0.257025 | GALAXY |
| 4 | 0.294775 | 0.479099 | 0.519436 | 0.453794 | 0.456601 | -0.224904 | GALAXY |
In the below code, we would like to plot a density plot as density plots are more effective for comparing multiple distributions.
With this density distribution, we would like to identify any variables that exhibits difference distributions between the different clusters (E.g. Star, Galaxy or Quasar).
# Plotting the distribution of different characteristics values based on their class.
star_exploration_plot = alt.Chart(
star_scaled.melt(
id_vars=["Class"],
var_name="Characteristics",
value_name="Values",
)
).transform_density(
"Values",
groupby=["Class", "Characteristics"],
as_=["Values", "Density"]
).mark_area(opacity=0.6).encode(
x=alt.X("Values").scale(base=10),
y=alt.Y("Density:Q", title="Density"),
color="Class:N"
).properties(
width=150,
height=150
).facet(
alt.Facet(
"Characteristics",
sort=star_scaled.columns[:-1].tolist()
),
columns=6
).resolve_scale(
# We are setting the x-scale to "independent" since we standardized the rating values,
# which means that their original range (which is what we show here) does not matter
x="independent",
y="independent"
)
star_exploration_plot